{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# AutoML for Text Classification\n",
    "\n",
    "## Learning Objectives\n",
    "\n",
    "1. Learn how to create a text classification dataset for AutoML using BigQuery\n",
    "1. Learn how to train AutoML to build a text classification model\n",
    "1. Learn how to evaluate a model trained with AutoML\n",
    "1. Learn how to predict on new test data with AutoML\n",
    "\n",
    "## Introduction\n",
    "\n",
    "In this notebook, we will use [AutoML for Text Classification](https://cloud.google.com/natural-language/automl/docs/beginners-guide) to train a text model to recognize the source of article titles:  New York Times, TechCrunch or GitHub. \n",
    "\n",
    "In a first step, we will query a public dataset on BigQuery taken from [hacker news](https://news.ycombinator.com/) ( it is an aggregator that displays tech related headlines from various  sources) to create our training set.\n",
    "\n",
    "In a second step, use the AutoML UI to upload our dataset, train a text model on it, and evaluate the model we have just trained.\n",
    "\n",
    "Each learning objective will correspond to a __#TODO__ in this student lab notebook -- try to complete this notebook first and then review the [solution notebook](../solutions/automl_for_text_classification.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "from google.cloud import bigquery\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext google.cloud.bigquery"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Replace the variable values in the cell below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROJECT = \"cloud-training-demos\"  # Replace with your PROJECT\n",
    "BUCKET = PROJECT  # defaults to PROJECT\n",
    "REGION = \"us-central1\"  # Replace with your REGION\n",
    "\n",
    "os.environ['PROJECT'] = PROJECT\n",
    "os.environ['BUCKET'] = BUCKET\n",
    "os.environ['REGION'] = REGION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gcloud storage buckets create gs://$BUCKET"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a Dataset from BigQuery \n",
    "\n",
    "Hacker news headlines are available as a BigQuery public dataset. The [dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=hacker_news&page=table&t=stories) contains all headlines from the sites inception in October 2006 until October 2015. \n",
    "\n",
    "### Lab Task 1a: \n",
    "Complete the query below to create a sample dataset containing the `url`, `title`, and `score` of articles from the public dataset `bigquery-public-data.hacker_news.stories`. Use a WHERE clause to restrict to only those articles with\n",
    "* title length greater than 10 characters\n",
    "* score greater than 10\n",
    "* url length greater than 0 characters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "\n",
    "SELECT\n",
    "    # TODO: Your code goes here.\n",
    "FROM\n",
    "    # TODO: Your code goes here.\n",
    "WHERE\n",
    "    # TODO: Your code goes here.\n",
    "    # TODO: Your code goes here.\n",
    "    # TODO: Your code goes here.\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with <i>nytimes</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lab task 1b:\n",
    "Complete the query below to count the number of titles within each 'source' category. Note that to grab the 'source' of the article we use the a regex command on the `url` of the article. To count the number of articles you'll use a `GROUP BY` in sql, and we'll also restrict our attention to only those articles whose title has greater than 10 characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "\n",
    "SELECT\n",
    "    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,\n",
    "    # TODO: Your code goes here.\n",
    "FROM\n",
    "    `bigquery-public-data.hacker_news.stories`\n",
    "WHERE\n",
    "    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')\n",
    "    # TODO: Your code goes here.\n",
    "GROUP BY\n",
    "    # TODO: Your code goes here.\n",
    "ORDER BY num_articles DESC\n",
    "  LIMIT 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regex = '.*://(.[^/]+)/'\n",
    "\n",
    "\n",
    "sub_query = \"\"\"\n",
    "SELECT\n",
    "    title,\n",
    "    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '{0}'), '.'))[OFFSET(1)] AS source\n",
    "    \n",
    "FROM\n",
    "    `bigquery-public-data.hacker_news.stories`\n",
    "WHERE\n",
    "    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '{0}'), '.com$')\n",
    "    AND LENGTH(title) > 10\n",
    "\"\"\".format(regex)\n",
    "\n",
    "\n",
    "query = \"\"\"\n",
    "SELECT \n",
    "    LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title,\n",
    "    source\n",
    "FROM\n",
    "  ({sub_query})\n",
    "WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')\n",
    "\"\"\".format(sub_query=sub_query)\n",
    "\n",
    "print(query)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For ML training, we usually need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset). AutoML however figures out on its own how to create these splits, so we won't need to do that here. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bq = bigquery.Client(project=PROJECT)\n",
    "title_dataset = bq.query(query).to_dataframe()\n",
    "title_dataset.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "AutoML for text classification requires that\n",
    "* the dataset be in csv form with \n",
    "* the first column being the texts to classify or a GCS path to the text \n",
    "* the last column to be the text labels\n",
    "\n",
    "The dataset we pulled from BigQuery satisfies these requirements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"The full dataset contains {n} titles\".format(n=len(title_dataset)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make sure we have roughly the same number of labels for each of our three labels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "title_dataset.source.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally we will save our data, which is currently in-memory, to disk.\n",
    "\n",
    "We will create a csv file containing the full dataset and another containing only 500 articles for development.\n",
    "\n",
    "**Note:** It may take a long time to train AutoML on the full dataset, so we recommend to use the sample dataset for the purpose of learning the tool. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DATADIR = './data/'\n",
    "\n",
    "if not os.path.exists(DATADIR):\n",
    "    os.makedirs(DATADIR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "FULL_DATASET_NAME = 'titles_full.csv'\n",
    "FULL_DATASET_PATH = os.path.join(DATADIR, FULL_DATASET_NAME)\n",
    "\n",
    "# Let's shuffle the data before writing it to disk.\n",
    "title_dataset = title_dataset.sample(n=len(title_dataset))\n",
    "\n",
    "title_dataset.to_csv(\n",
    "    FULL_DATASET_PATH, header=False, index=False, encoding='utf-8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's sample 500 articles from the full dataset and make sure we have enough examples for each label in our sample dataset (see [here](https://cloud.google.com/natural-language/automl/docs/beginners-guide) for further details on how to prepare data for AutoML)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lab Task 1c:\n",
    "Use `.sample` to create a sample dataset of 1,000 articles from the full dataset. Use `.value_counts` to see how many articles are contained in each of the three source categories?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_title_dataset = # TODO: Your code goes here.\n",
    "# TODO: Your code goes here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's write the sample dataset to disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "SAMPLE_DATASET_NAME = 'titles_sample.csv'\n",
    "SAMPLE_DATASET_PATH = os.path.join(DATADIR, SAMPLE_DATASET_NAME)\n",
    "\n",
    "sample_title_dataset.to_csv(\n",
    "    SAMPLE_DATASET_PATH, header=False, index=False, encoding='utf-8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_title_dataset.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "gcloud storage cp data/titles_sample.csv gs://$BUCKET"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train a Model with AutoML for Text Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lab Task 2:\n",
    "Complete Steps 1-3 below to train a text classification model using AutoML."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: Launch AutoML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Go to the GCP console, and click on the Natural Language service in the console menu.\n",
    "Click on **ENABLE API** if the API is not enabled.\n",
    "\n",
    "Then click on **Get started** in the **AutoML Text & Document Classification** tile."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![AutoML](./assets/get_started.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Create a Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select **New Dataset**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![AutoML](./assets/new_dataset_1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then\n",
    "\n",
    "1. Give the new dataset a name\n",
    "1. Choose **Multi-label classification**\n",
    "1. Hit **Create Dataset**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![AutoML](./assets/crate_a_dataset_1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then\n",
    "\n",
    "1. In 'Select files to import' section, choose **Select a CSV file on Cloud Storage**.\n",
    "1. Click on **Browse** and upload `titles_sample.csv` we created above.\n",
    "1. Hit **Import**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![AutoML](./assets/select_files_to_import.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This step may take a while. You should see the following screen:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![wait](./assets/processing_text_items_1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Train a AutoML text model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When the dataset has been imported, you can inspect it in the AutoML UI, and get statistics about the label distribution. If you are happy with what you see, proceed to train a text model from this dataset:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![AutoML](./assets/items.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then\n",
    "\n",
    "1. Switch to **Train** tab.\n",
    "1. Click on **START TRAINING** and confirm again by clicking on **START TRAINING**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training step may take a few hours, while AutoML is searching for the best model to crush this dataset. \n",
    "\n",
    "You should see the following screen:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![AutoML](./assets/train_dataset_1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NOTE: It takes more than 2 hours to complete the training. Please wait till the training get completed. If your training takes more time than lab time, please only review the next sections.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lab Task 3: \n",
    "Complete Step 4 below to evaluate the AutoML model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Evaluate the model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the model is trained, click on **Evaluate** to understand how the model performed. You'll be able to see the overall precision and recall, as well as drill down to performances at the individual label level.\n",
    "\n",
    "AutoML UI will also show you examples where the model made a mistake for each of the labels."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![AutoML](./assets/evaluate_1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lab Task 4:\n",
    "Complete Step 5 below to call prediction on your AutoML text classification model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Predict with the trained AutoML model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can test your model directly by entering new text in the UI and having AutoML predicts the source of your snippet:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![AutoML](./assets/test_use_1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2022 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "environment": {
   "name": "tf2-gpu.2-1.m68",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-1:m68"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
